Optimal instance selection for improved decision tree

نویسندگان

  • Shuning Wu
  • John Jackman
  • Sarah Ryan
  • Di Cook
  • Dan Zhu
چکیده

Instance selection plays an important role in improving scalability of data mining algorithms, but it can also be used to improve the quality of the data mining results. In this dissertation we present a new optimization-based approach for instance selection that uses a genetic algorithm (GA) to select a subset of instances to produce a simpler decision tree with acceptable accuracy. The resultant trees are likely to be easier to comprehend and interpret by the decision maker and hence more useful in practice. We present numerical results for several difficult test datasets that indicate that GA-based instance selection can often reduce the size of the decision tree by an order of magnitude while still maintaining good prediction accuracy. The results suggest that GA-based instance selection works best for low entropy datasets. With higher entropy, there will be less benefit from instance selection. A comparison between GA and other heuristic approaches such as Rmhc (Random Mutation Hill Climbing) and simple construction heuristic, indicates that GA is able to obtain a good solution with low computation cost even for some large datasets. One advantage of instance selection is that it is able to increase the average instances associated with the leaves of the decision trees to avoid overfitting, thus instance selection can be used as an effective alternative to prune decision trees. Finally, the analysis on the selected instances reveals that instance selection helps to reduce outliers, reduce missing values, and select the most useful instances for separating classes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Decision Tree for Technology Selection of Nitrogen Production Plants

Nitrogen is produced mainly from its most abundant source, the air, using three processes: membrane, pressure swing adsorption (PSA) and cryogenic. The most common method for evaluating a process is using the selection diagrams based on feasibility studies. Since the selection diagrams are presented by different companies, they are biased, and provide unsimilar and even controversial results. I...

متن کامل

Prosodic Boundary Prediction for Greek Speech Synthesis

In this article, we evaluate features and algorithms for the task of prosodic boundary prediction for Greek. For this purpose a prosodic corpus composed of generic domain text was constructed. Feature contribution was evaluated and ranked with the application of information gain ranking and correlation -based feature selection filtering methods. Resulted datasets were applied to C4.5 decision t...

متن کامل

Anomaly Detection Using SVM as Classifier and Decision Tree for Optimizing Feature Vectors

Abstract- With the advancement and development of computer network technologies, the way for intruders has become smoother; therefore, to detect threats and attacks, the importance of intrusion detection systems (IDS) as one of the key elements of security is increasing. One of the challenges of intrusion detection systems is managing of the large amount of network traffic features. Removing un...

متن کامل

IFSB-ReliefF: A New Instance and Feature Selection Algorithm Based on ReliefF

Increasing the use of Internet and some phenomena such as sensor networks has led to an unnecessary increasing the volume of information. Though it has many benefits, it causes problems such as storage space requirements and better processors, as well as data refinement to remove unnecessary data. Data reduction methods provide ways to select useful data from a large amount of duplicate, incomp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015